561 research outputs found
High Performance Computing of Gene Regulatory Networks using a Message-Passing Model
Gene regulatory network reconstruction is a fundamental problem in
computational biology. We recently developed an algorithm, called PANDA
(Passing Attributes Between Networks for Data Assimilation), that integrates
multiple sources of 'omics data and estimates regulatory network models. This
approach was initially implemented in the C++ programming language and has
since been applied to a number of biological systems. In our current research
we are beginning to expand the algorithm to incorporate larger and most diverse
data-sets, to reconstruct networks that contain increasing numbers of elements,
and to build not only single network models, but sets of networks. In order to
accomplish these "Big Data" applications, it has become critical that we
increase the computational efficiency of the PANDA implementation. In this
paper we show how to recast PANDA's similarity equations as matrix operations.
This allows us to implement a highly readable version of the algorithm using
the MATLAB/Octave programming language. We find that the resulting M-code much
shorter (103 compared to 1128 lines) and more easily modifiable for potential
future applications. The new implementation also runs significantly faster,
with increasing efficiency as the network models increase in size. Tests
comparing the C-code and M-code versions of PANDA demonstrate that this
speed-up is on the order of 20-80 times faster for networks of similar
dimensions to those we find in current biological applications
Data reporting standards: making the things we use better
Genomic data often persist far beyond the initial study in which they were generated. But the true value of the data is tied to their being both used and useful, and the usefulness of the data relies intimately on how well annotated they are. While standards such as MIAME have been in existence for nearly a decade, we cannot think that the problem is solved or that we can ignore the need to develop better, more effective methods for capturing the essence of the meta-data that is ultimately required to guarantee utility of the data
Cascade Size Distributions: Why They Matter and How to Compute Them Efficiently
Cascade models are central to understanding, predicting, and controlling
epidemic spreading and information propagation. Related optimization, including
influence maximization, model parameter inference, or the development of
vaccination strategies, relies heavily on sampling from a model. This is either
inefficient or inaccurate. As alternative, we present an efficient message
passing algorithm that computes the probability distribution of the cascade
size for the Independent Cascade Model on weighted directed networks and
generalizations. Our approach is exact on trees but can be applied to any
network topology. It approximates locally tree-like networks well, scales to
large networks, and can lead to surprisingly good performance on more dense
networks, as we also exemplify on real world data.Comment: Accepted at AAAI 202
Estimating sample-specific regulatory networks
Biological systems are driven by intricate interactions among the complex
array of molecules that comprise the cell. Many methods have been developed to
reconstruct network models of those interactions. These methods often draw on
large numbers of samples with measured gene expression profiles to infer
connections between genes (or gene products). The result is an aggregate
network model representing a single estimate for the likelihood of each
interaction, or "edge," in the network. While informative, aggregate models
fail to capture the heterogeneity that is represented in any population. Here
we propose a method to reverse engineer sample-specific networks from aggregate
network models. We demonstrate the accuracy and applicability of our approach
in several data sets, including simulated data, microarray expression data from
synchronized yeast cells, and RNA-seq data collected from human lymphoblastoid
cell lines. We show that these sample-specific networks can be used to study
changes in network topology across time and to characterize shifts in gene
regulation that may not be apparent in expression data. We believe the ability
to generate sample-specific networks will greatly facilitate the application of
network methods to the increasingly large, complex, and heterogeneous
multi-omic data sets that are currently being generated, and ultimately support
the emerging field of precision network medicine
A High-Throughput DNA Sequence Aligner for Microbial Ecology Studies
As the scope of microbial surveys expands with the parallel growth in sequencing capacity, a significant bottleneck in data analysis is the ability to generate a biologically meaningful multiple sequence alignment. The most commonly used aligners have varying alignment quality and speed, tend to depend on a specific reference alignment, or lack a complete description of the underlying algorithm. The purpose of this study was to create and validate an aligner with the goal of quickly generating a high quality alignment and having the flexibility to use any reference alignment. Using the simple nearest alignment space termination algorithm, the resulting aligner operates in linear time, requires a small memory footprint, and generates a high quality alignment. In addition, the alignments generated for variable regions were of as high a quality as the alignment of full-length sequences. As implemented, the method was able to align 18 full-length 16S rRNA gene sequences and 58 V2 region sequences per second to the 50,000-column SILVA reference alignment. Most importantly, the resulting alignments were of a quality equal to SILVA-generated alignments. The aligner described in this study will enable scientists to rapidly generate robust multiple sequences alignments that are implicitly based upon the predicted secondary structure of the 16S rRNA molecule. Furthermore, because the implementation is not connected to a specific database it is easy to generalize the method to reference alignments for any DNA sequence
Recommended from our members
Inferring steady state single-cell gene expression distributions from analysis of mesoscopic samples
BACKGROUND: A great deal of interest has been generated by systems biology approaches that attempt to develop quantitative, predictive models of cellular processes. However, the starting point for all cellular gene expression, the transcription of RNA, has not been described and measured in a population of living cells. RESULTS: Here we present a simple model for transcript levels based on Poisson statistics and provide supporting experimental evidence for genes known to be expressed at high, moderate, and low levels. CONCLUSION: Although the model describes a microscopic process occurring at the level of an individual cell, the supporting data we provide uses a small number of cells where the echoes of the underlying stochastic processes can be seen. Not only do these data confirm our model, but this general strategy opens up a potential new approach, Mesoscopic Biology, that can be used to assess the natural variability of processes occurring at the cellular level in biological systems
- …